{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "654dc6eb422087a85df50aad9646e3c3b8692e5d"
   },
   "source": [
    "#                                                                                 BERT\n",
    "\n",
    "BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.\n",
    "\n",
    "Academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: https://arxiv.org/abs/1810.04805.\n",
    "\n",
    "Github account for the paper can be found here: https://github.com/google-research/bert\n",
    "\n",
    "BERT is a method of pre-training language representations, meaning training of a general-purpose \"language understanding\" model on a large text corpus (like Wikipedia), and then using that model for downstream NLP tasks (like question answering). BERT outperforms previous methods because it is the first *unsupervised, deeply bidirectional *system for pre-training NLP."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "8d6fe072db4c37b40476a41be674be47312d6251"
   },
   "source": [
    "![](https://www.lyrn.ai/wp-content/uploads/2018/11/transformer.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "85bc6b83b518e604a855a149ff9a5ce44f991620"
   },
   "source": [
    "# Downloading all necessary dependencies\n",
    "You will have to turn on internet for that.\n",
    "\n",
    "This code is slightly modefied version of this colab notebook https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import zipfile\n",
    "from matplotlib import pyplot as plt\n",
    "%matplotlib inline\n",
    "import sys\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
   },
   "outputs": [],
   "source": [
    "#downloading weights and cofiguration file for the model\n",
    "!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8cf1f5466e25bbf042e0b6d55355d0bf7f02984e"
   },
   "outputs": [],
   "source": [
    "repo = 'model_repo'\n",
    "with zipfile.ZipFile(\"uncased_L-12_H-768_A-12.zip\",\"r\") as zip_ref:\n",
    "    zip_ref.extractall(repo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b4a8b81c1fb8b5fae20945a56fda981550c54342"
   },
   "outputs": [],
   "source": [
    "!ls 'model_repo/uncased_L-12_H-768_A-12'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0595c54d1033b9b0e69c5e565a8063a32295c3ff"
   },
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/google-research/bert/master/modeling.py \n",
    "!wget https://raw.githubusercontent.com/google-research/bert/master/optimization.py \n",
    "!wget https://raw.githubusercontent.com/google-research/bert/master/run_classifier.py \n",
    "!wget https://raw.githubusercontent.com/google-research/bert/master/tokenization.py "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "68f9932a52af968d1e94d0e115fc06e3423e1d47"
   },
   "source": [
    "Example below is done on preprocessing code, similar to **CoLa**:\n",
    "\n",
    "The Corpus of Linguistic Acceptability is\n",
    "a binary single-sentence classification task, where \n",
    "the goal is to predict whether an English sentence\n",
    "is linguistically “acceptable” or not\n",
    "\n",
    "You can use pretrained BERT model for wide variety of tasks, including classification.\n",
    "The task of CoLa is close to the task of Quora competition, so I thought it woud be interesting to use that example.\n",
    "Obviously, outside sources aren't allowed in Quora competition, so you won't be able to use BERT to submit a prediction.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0f47047973d523029fbe9c355577133c4656ce57"
   },
   "outputs": [],
   "source": [
    "# Available pretrained model checkpoints:\n",
    "#   uncased_L-12_H-768_A-12: uncased BERT base model\n",
    "#   uncased_L-24_H-1024_A-16: uncased BERT large model\n",
    "#   cased_L-12_H-768_A-12: cased BERT large model\n",
    "#We will use the most basic of all of them\n",
    "BERT_MODEL = 'uncased_L-12_H-768_A-12'\n",
    "BERT_PRETRAINED_DIR = f'{repo}/uncased_L-12_H-768_A-12'\n",
    "OUTPUT_DIR = f'{repo}/outputs'\n",
    "print(f'***** Model output directory: {OUTPUT_DIR} *****')\n",
    "print(f'***** BERT pretrained directory: {BERT_PRETRAINED_DIR} *****')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3a2a0d12de6ecb82812a36549ea910b5a55c0c4a"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "train_df =  pd.read_csv('../input/train.csv')\n",
    "train_df = train_df.sample(2000)\n",
    "\n",
    "train, test = train_test_split(train_df, test_size = 0.1, random_state=42)\n",
    "\n",
    "train_lines, train_labels = train.question_text.values, train.target.values\n",
    "test_lines, test_labels = test.question_text.values, test.target.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b8804d16e48636d8a5c1474c4384076b16c73adc"
   },
   "outputs": [],
   "source": [
    "import modeling\n",
    "import optimization\n",
    "import run_classifier\n",
    "import tokenization\n",
    "import tensorflow as tf\n",
    "\n",
    "\n",
    "def create_examples(lines, set_type, labels=None):\n",
    "#Generate data for the BERT model\n",
    "    guid = f'{set_type}'\n",
    "    examples = []\n",
    "    if guid == 'train':\n",
    "        for line, label in zip(lines, labels):\n",
    "            text_a = line\n",
    "            label = str(label)\n",
    "            examples.append(\n",
    "              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n",
    "    else:\n",
    "        for line in lines:\n",
    "            text_a = line\n",
    "            label = '0'\n",
    "            examples.append(\n",
    "              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))\n",
    "    return examples\n",
    "\n",
    "# Model Hyper Parameters\n",
    "TRAIN_BATCH_SIZE = 32\n",
    "EVAL_BATCH_SIZE = 8\n",
    "LEARNING_RATE = 2e-5\n",
    "NUM_TRAIN_EPOCHS = 3.0\n",
    "WARMUP_PROPORTION = 0.1\n",
    "MAX_SEQ_LENGTH = 128\n",
    "# Model configs\n",
    "SAVE_CHECKPOINTS_STEPS = 1000 #if you wish to finetune a model on a larger dataset, use larger interval\n",
    "# each checpoint weights about 1,5gb\n",
    "ITERATIONS_PER_LOOP = 1000\n",
    "NUM_TPU_CORES = 8\n",
    "VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')\n",
    "CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')\n",
    "INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')\n",
    "DO_LOWER_CASE = BERT_MODEL.startswith('uncased')\n",
    "\n",
    "label_list = ['0', '1']\n",
    "tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)\n",
    "train_examples = create_examples(train_lines, 'train', labels=train_labels)\n",
    "\n",
    "tpu_cluster_resolver = None #Since training will happen on GPU, we won't need a cluster resolver\n",
    "#TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.\n",
    "run_config = tf.contrib.tpu.RunConfig(\n",
    "    cluster=tpu_cluster_resolver,\n",
    "    model_dir=OUTPUT_DIR,\n",
    "    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,\n",
    "    tpu_config=tf.contrib.tpu.TPUConfig(\n",
    "        iterations_per_loop=ITERATIONS_PER_LOOP,\n",
    "        num_shards=NUM_TPU_CORES,\n",
    "        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))\n",
    "\n",
    "num_train_steps = int(\n",
    "    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)\n",
    "num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)\n",
    "\n",
    "model_fn = run_classifier.model_fn_builder(\n",
    "    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),\n",
    "    num_labels=len(label_list),\n",
    "    init_checkpoint=INIT_CHECKPOINT,\n",
    "    learning_rate=LEARNING_RATE,\n",
    "    num_train_steps=num_train_steps,\n",
    "    num_warmup_steps=num_warmup_steps,\n",
    "    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available  \n",
    "    use_one_hot_embeddings=True)\n",
    "\n",
    "estimator = tf.contrib.tpu.TPUEstimator(\n",
    "    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available \n",
    "    model_fn=model_fn,\n",
    "    config=run_config,\n",
    "    train_batch_size=TRAIN_BATCH_SIZE,\n",
    "    eval_batch_size=EVAL_BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b69a1b734026d0b4b16c3eeb405e920195183873"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Note: You might see a message 'Running train on CPU'. \n",
    "This really just means that it's running on something other than a Cloud TPU, which includes a GPU.\n",
    "\"\"\"\n",
    "\n",
    "# Train the model.\n",
    "print('Please wait...')\n",
    "train_features = run_classifier.convert_examples_to_features(\n",
    "    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)\n",
    "print('***** Started training at {} *****'.format(datetime.datetime.now()))\n",
    "print('  Num examples = {}'.format(len(train_examples)))\n",
    "print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))\n",
    "tf.logging.info(\"  Num steps = %d\", num_train_steps)\n",
    "train_input_fn = run_classifier.input_fn_builder(\n",
    "    features=train_features,\n",
    "    seq_length=MAX_SEQ_LENGTH,\n",
    "    is_training=True,\n",
    "    drop_remainder=True)\n",
    "estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)\n",
    "print('***** Finished training at {} *****'.format(datetime.datetime.now()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "58fb06968c09872ccd74b44e07b4a6e932beb6df",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "There is a weird bug in original code.\n",
    "When predicting, estimator returns an empty dict {}, without batch_size.\n",
    "I redefine input_fn_builder and hardcode batch_size, irnoring 'params' for now.\n",
    "\"\"\"\n",
    "\n",
    "def input_fn_builder(features, seq_length, is_training, drop_remainder):\n",
    "  \"\"\"Creates an `input_fn` closure to be passed to TPUEstimator.\"\"\"\n",
    "\n",
    "  all_input_ids = []\n",
    "  all_input_mask = []\n",
    "  all_segment_ids = []\n",
    "  all_label_ids = []\n",
    "\n",
    "  for feature in features:\n",
    "    all_input_ids.append(feature.input_ids)\n",
    "    all_input_mask.append(feature.input_mask)\n",
    "    all_segment_ids.append(feature.segment_ids)\n",
    "    all_label_ids.append(feature.label_id)\n",
    "\n",
    "  def input_fn(params):\n",
    "    \"\"\"The actual input function.\"\"\"\n",
    "    print(params)\n",
    "    batch_size = 32\n",
    "\n",
    "    num_examples = len(features)\n",
    "\n",
    "    d = tf.data.Dataset.from_tensor_slices({\n",
    "        \"input_ids\":\n",
    "            tf.constant(\n",
    "                all_input_ids, shape=[num_examples, seq_length],\n",
    "                dtype=tf.int32),\n",
    "        \"input_mask\":\n",
    "            tf.constant(\n",
    "                all_input_mask,\n",
    "                shape=[num_examples, seq_length],\n",
    "                dtype=tf.int32),\n",
    "        \"segment_ids\":\n",
    "            tf.constant(\n",
    "                all_segment_ids,\n",
    "                shape=[num_examples, seq_length],\n",
    "                dtype=tf.int32),\n",
    "        \"label_ids\":\n",
    "            tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),\n",
    "    })\n",
    "\n",
    "    if is_training:\n",
    "      d = d.repeat()\n",
    "      d = d.shuffle(buffer_size=100)\n",
    "\n",
    "    d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)\n",
    "    return d\n",
    "\n",
    "  return input_fn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "20a827baca921e1b48ac3de20422e6d384e2e450"
   },
   "outputs": [],
   "source": [
    "predict_examples = create_examples(test_lines, 'test')\n",
    "\n",
    "predict_features = run_classifier.convert_examples_to_features(\n",
    "    predict_examples, label_list, MAX_SEQ_LENGTH, tokenizer)\n",
    "\n",
    "predict_input_fn = input_fn_builder(\n",
    "    features=predict_features,\n",
    "    seq_length=MAX_SEQ_LENGTH,\n",
    "    is_training=False,\n",
    "    drop_remainder=False)\n",
    "\n",
    "result = estimator.predict(input_fn=predict_input_fn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ee2e686b144528d8eb5a2c50f98c52816a14c657"
   },
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "preds = []\n",
    "for prediction in tqdm(result):\n",
    "    for class_probability in prediction:\n",
    "      preds.append(float(class_probability))\n",
    "\n",
    "results = []\n",
    "for i in tqdm(range(0,len(preds),2)):\n",
    "  if preds[i] < 0.9:\n",
    "    results.append(1)\n",
    "  else:\n",
    "    results.append(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "4acc876dc974dd4d93525e3b163859ddfda33b2d"
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "print(accuracy_score(np.array(results), test_labels))\n",
    "print(f1_score(np.array(results), test_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "49299e2de55fc4136689cc43905fbc4690060c65"
   },
   "source": [
    "There are several downsides for BERT at this moment:\n",
    "\n",
    "- Training is expensive. All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small. \n",
    "\n",
    "- At the moment BERT supports only English, though addition of other languages is expected.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a3cbd0cdf0b747e13c427d07f295905c38c813ec"
   },
   "source": [
    "# Competition test\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ce1d0b361bfcaf9d58334b0f8facce9944ceadff"
   },
   "source": [
    "I've run a test with 30% of Quora data on free colab cloud TPU and achieved f1 score of 0.684.(11th place at the moment)\n",
    "\n",
    "**You can't use BERT in the competition, the notebook will fail when it comes to real testing.\n",
    "I apologies for littering the leaderboard, but I couldn't believe that local test where so high, I had to check.**\n",
    "\n",
    "Training took about 30-40 minutes.\n",
    "Results are really amazing, espetially because it's a raw model on random 30% sample with no optimization or ensamble, using the simlest of 3 released models.\n",
    "I didn't even have to preprocess anything, model does it for you.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "f876fa03d1b5b811c5fb3a531dad2cfb6d61031c"
   },
   "source": [
    "<img src=\"https://image.ibb.co/gGCZ0A/image.jpg\" alt=\"image\" border=\"0\">"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
